Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 55
Filtrar
1.
Proteomes ; 5(1)2017 Feb 03.
Artigo em Inglês | MEDLINE | ID: mdl-28248256

RESUMO

Medulloblastoma (MB) is the most common malignant pediatric brain tumor. Patient survival has remained largely the same for the past 20 years, with therapies causing significant health, cognitive, behavioral and developmental complications for those who survive the tumor. In this study, we profiled the total transcriptome and proteome of two established MB cell lines, Daoy and UW228, using high-throughput RNA sequencing (RNA-Seq) and label-free nano-LC-MS/MS-based quantitative proteomics, coupled with advanced pathway analysis. While Daoy has been suggested to belong to the sonic hedgehog (SHH) subtype, the exact UW228 subtype is not yet clearly established. Thus, a goal of this study was to identify protein markers and pathways that would help elucidate their subtype classification. A number of differentially expressed genes and proteins, including a number of adhesion, cytoskeletal and signaling molecules, were observed between the two cell lines. While several cancer-associated genes/proteins exhibited similar expression across the two cell lines, upregulation of a number of signature proteins and enrichment of key components of SHH and WNT signaling pathways were uniquely observed in Daoy and UW228, respectively. The novel information on differentially expressed genes/proteins and enriched pathways provide insights into the biology of MB, which could help elucidate their subtype classification.

2.
Big Data ; 4(1): 60-6, 2016 03.
Artigo em Inglês | MEDLINE | ID: mdl-27441585

RESUMO

This case study evaluates and tracks vitality of a city (Seattle), based on a data-driven approach, using strategic, robust, and sustainable metrics. This case study was collaboratively conducted by the Downtown Seattle Association (DSA) and CDO Analytics teams. The DSA is a nonprofit organization focused on making the city of Seattle and its Downtown a healthy and vibrant place to Live, Work, Shop, and Play. DSA primarily operates through public policy advocacy, community and business development, and marketing. In 2010, the organization turned to CDO Analytics ( cdoanalytics.org ) to develop a process that can guide and strategically focus DSA efforts and resources for maximal benefit to the city of Seattle and its Downtown. CDO Analytics was asked to develop clear, easily understood, and robust metrics for a baseline evaluation of the health of the city, as well as for ongoing monitoring and comparisons of the vitality, sustainability, and growth. The DSA and CDO Analytics teams strategized on how to effectively assess and track the vitality of Seattle and its Downtown. The two teams filtered a variety of data sources, and evaluated the veracity of multiple diverse metrics. This iterative process resulted in the development of a small number of strategic, simple, reliable, and sustainable metrics across four pillars of activity: Live, Work, Shop, and Play. Data during the 5 years before 2010 were used for the development of the metrics and model and its training, and data during the 5 years from 2010 and on were used for testing and validation. This work enabled DSA to routinely track these strategic metrics, use them to monitor the vitality of Downtown Seattle, prioritize improvements, and identify new value-added programs. As a result, the four-pillar approach became an integral part of the data-driven decision-making and execution of the Seattle community's improvement activities. The approach described in this case study is actionable, robust, inexpensive, and easy to adopt and sustain. It can be applied to cities, districts, counties, regions, states, or countries, enabling cross-comparisons and improvements of vitality, sustainability, and growth.


Assuntos
Planejamento de Cidades/métodos , Estudos de Casos Organizacionais , Humanos , Aprendizado de Máquina , Washington
3.
Stem Cells Int ; 2016: 6183562, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26681951

RESUMO

Current approaches in human embryonic stem cell (hESC) to pancreatic beta cell differentiation have largely been based on knowledge gained from developmental studies of the epithelial pancreas, while the potential roles of other supporting tissue compartments have not been fully explored. One such tissue is the pancreatic mesenchyme that supports epithelial organogenesis throughout embryogenesis. We hypothesized that detailed characterization of the pancreatic mesenchyme might result in the identification of novel factors not used in current differentiation protocols. Supplementing existing hESC differentiation conditions with such factors might create a more comprehensive simulation of normal development in cell culture. To validate our hypothesis, we took advantage of a novel transgenic mouse model to isolate the pancreatic mesenchyme at distinct embryonic and postnatal stages for subsequent proteomic analysis. Refined sample preparation and analysis conditions across four embryonic and prenatal time points resulted in the identification of 21,498 peptides with high-confidence mapping to 1,502 proteins. Expression analysis of pancreata confirmed the presence of three potentially important factors in cell differentiation: Galectin-1 (LGALS1), Neuroplastin (NPTN), and the Laminin α-2 subunit (LAMA2). Two of the three factors (LGALS1 and LAMA2) increased expression of pancreatic progenitor transcript levels in a published hESC to beta cell differentiation protocol. In addition, LAMA2 partially blocks cell culture induced beta cell dedifferentiation. Summarily, we provide evidence that proteomic analysis of supporting tissues such as the pancreatic mesenchyme allows for the identification of potentially important factors guiding hESC to pancreas differentiation.

4.
OMICS ; 19(12): 754-6, 2015 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-26575978

RESUMO

Gene/disease associations are a critical part of exploring disease causes and ultimately cures, yet the publications that might provide such information are too numerous to be manually reviewed. We present a software utility, MOPED-Digger, that enables focused human assessment of literature by applying natural language processing (NLP) to search for customized lists of genes and diseases in titles and abstracts from biomedical publications. The results are ranked lists of gene/disease co-appearances and the publications that support them. Analysis of 18,159,237 PubMed title/abstracts yielded 1,796,799 gene/disease co-appearances that can be used to focus attention on the most promising publications for a possible gene/disease association. An integrated score is provided to enable assessment of broadly presented published evidence to capture more tenuous connections. MOPED-Digger is written in Java and uses Apache Lucene 5.0 library. The utility runs as a command-line program with a variety of user-options and is freely available for download from the MOPED 3.0 website (moped.proteinspire.org).


Assuntos
Biologia Computacional/métodos , Estudos de Associação Genética/métodos , Predisposição Genética para Doença , Software , Humanos
5.
J Proteome Res ; 14(6): 2398-407, 2015 Jun 05.
Artigo em Inglês | MEDLINE | ID: mdl-25877823

RESUMO

Although biological science discovery often involves comparing conditions to a normal state, in proteomics little is actually known about normal. Two Human Proteome studies featured in Nature offer new insights into protein expression and an opportunity to assess how high-throughput proteomics measures normal protein ranges. We use data from these studies to estimate technical and biological variability in protein expression and compare them to other expression data sets from normal tissue. Results show that measured protein expression across same-tissue replicates vary by ±4- to 10-fold for most proteins. Coefficients of variation (CV) for protein expression measurements range from 62% to 117% across different tissue experiments; however, adjusting for technical variation reduced this variability by as much as 50%. In addition, the CV could also be reduced by limiting comparisons to proteins with at least 3 or more unique peptide identifications as the CV was on average 33% lower than for proteins with 2 or fewer peptide identifications. We also selected 13 housekeeping proteins and genes that were expressed across all tissues with low variability to determine their utility as a reference set for normalization and comparative purposes. These results present the first step toward estimating normal protein ranges by determining the variability in expression measurements through combining publicly available data. They support an approach that combines standard protocols with replicates of normal tissues to estimate normal protein ranges for large numbers of proteins and tissues. This would be a tremendous resource for normal cellular physiology and comparisons of proteomics studies.


Assuntos
Ensaios de Triagem em Larga Escala , Proteínas/metabolismo , Proteômica , Humanos , Valores de Referência , Reprodutibilidade dos Testes
6.
OMICS ; 19(4): 197-208, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-25831060

RESUMO

Complex diseases are caused by a combination of genetic and environmental factors, creating a difficult challenge for diagnosis and defining subtypes. This review article describes how distinct disease subtypes can be identified through integration and analysis of clinical and multi-omics data. A broad shift toward molecular subtyping of disease using genetic and omics data has yielded successful results in cancer and other complex diseases. To determine molecular subtypes, patients are first classified by applying clustering methods to different types of omics data, then these results are integrated with clinical data to characterize distinct disease subtypes. An example of this molecular-data-first approach is in research on Autism Spectrum Disorder (ASD), a spectrum of social communication disorders marked by tremendous etiological and phenotypic heterogeneity. In the case of ASD, omics data such as exome sequences and gene and protein expression data are combined with clinical data such as psychometric testing and imaging to enable subtype identification. Novel ASD subtypes have been proposed, such as CHD8, using this molecular subtyping approach. Broader use of molecular subtyping in complex disease research is impeded by data heterogeneity, diversity of standards, and ineffective analysis tools. The future of molecular subtyping for ASD and other complex diseases calls for an integrated resource to identify disease mechanisms, classify new patients, and inform effective treatment options. This in turn will empower and accelerate precision medicine and personalized healthcare.


Assuntos
Transtorno do Espectro Autista/genética , Genômica , Medicina de Precisão , Transtorno do Espectro Autista/classificação , Transtorno do Espectro Autista/terapia , Análise por Conglomerados , Humanos , Tipagem Molecular
7.
Nucleic Acids Res ; 43(Database issue): D1145-51, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-25404128

RESUMO

MOPED (Multi-Omics Profiling Expression Database; http://moped.proteinspire.org) has transitioned from solely a protein expression database to a multi-omics resource for human and model organisms. Through a web-based interface, MOPED presents consistently processed data for gene, protein and pathway expression. To improve data quality, consistency and use, MOPED includes metadata detailing experimental design and analysis methods. The multi-omics data are integrated through direct links between genes and proteins and further connected to pathways and experiments. MOPED now contains over 5 million records, information for approximately 75,000 genes and 50,000 proteins from four organisms (human, mouse, worm, yeast). These records correspond to 670 unique combinations of experiment, condition, localization and tissue. MOPED includes the following new features: pathway expression, Pathway Details pages, experimental metadata checklists, experiment summary statistics and more advanced searching tools. Advanced searching enables querying for genes, proteins, experiments, pathways and keywords of interest. The system is enhanced with visualizations for comparing across different data types. In the future MOPED will expand the number of organisms, increase integration with pathways and provide connections to disease.


Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica , Proteômica , Animais , Humanos , Internet , Camundongos , Proteínas/genética , Proteínas/metabolismo
8.
OMICS ; 18(12): 767-77, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25353146

RESUMO

Metabolomics in systems biology research unravels intracellular metabolic changes by high throughput methods, but such studies focusing on liver transplantation (LT) are limited. Microdialysate samples of liver grafts from donors after circulatory death (DCD; n=13) and brain death (DBD; n=27) during cold storage and post-reperfusion phase were analyzed through coulometric electrochemical array detection (CEAD) for identification of key metabolomics changes. Metabolite peak differences between the graft types at cold phase, post-reperfusion trends, and in failed allografts, were identified against reference chromatograms. In the cold phase, xanthine, uric acid, and kynurenine were overexpressed in DCD by 3-fold, and 3-nitrotyrosine (3-NT) and 4-hydroxy-3-methoxymandelic acid (HMMA) in DBD by 2-fold (p<0.05). In both grafts, homovanillic acid and methionine increased by 20%-30% with each 100 min increase in cold ischemia time (p<0.05). Uric acid expression was significantly different in DCD post-reperfusion. Failed allografts had overexpression of reduced glutathione and kynurenine (cold phase) and xanthine (post-reperfusion) (p<0.05). This differential expression of metabolites between graft types is a novel finding, meanwhile identification of overexpression of kynurenine in DCD grafts and in failed allografts is unique. Further studies should examine kynurenine as a potential biomarker predicting graft function, its causation, and actions on subsequent clinical outcomes.


Assuntos
Biomarcadores/metabolismo , Transplante de Fígado/métodos , Metabolômica/métodos , Ácido Homovanílico/metabolismo , Humanos , Cinurenina/metabolismo , Metionina/metabolismo , Tirosina/análogos & derivados , Tirosina/metabolismo , Ácido Úrico/metabolismo , Xantina/metabolismo
9.
Concurr Comput ; 26(13): 2112-2121, 2014 Sep 10.
Artigo em Inglês | MEDLINE | ID: mdl-25313296

RESUMO

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

10.
OMICS ; 18(6): 335-43, 2014 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-24910945

RESUMO

Multi-omics data-driven scientific discovery crucially rests on high-throughput technologies and data sharing. Currently, data are scattered across single omics repositories, stored in varying raw and processed formats, and are often accompanied by limited or no metadata. The Multi-Omics Profiling Expression Database (MOPED, http://moped.proteinspire.org ) version 2.5 is a freely accessible multi-omics expression database. Continual improvement and expansion of MOPED is driven by feedback from the Life Sciences Community. In order to meet the emergent need for an integrated multi-omics data resource, MOPED 2.5 now includes gene relative expression data in addition to protein absolute and relative expression data from over 250 large-scale experiments. To facilitate accurate integration of experiments and increase reproducibility, MOPED provides extensive metadata through the Data-Enabled Life Sciences Alliance (DELSA Global, http://delsaglobal.org ) metadata checklist. MOPED 2.5 has greatly increased the number of proteomics absolute and relative expression records to over 500,000, in addition to adding more than four million transcriptomics relative expression records. MOPED has an intuitive user interface with tabs for querying different types of omics expression data and new tools for data visualization. Summary information including expression data, pathway mappings, and direct connection between proteins and genes can be viewed on Protein and Gene Details pages. These connections in MOPED provide a context for multi-omics expression data exploration. Researchers are encouraged to submit omics data which will be consistently processed into expression summaries. MOPED as a multi-omics data resource is a pivotal public database, interdisciplinary knowledge resource, and platform for multi-omics understanding.


Assuntos
Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Software , Animais , Humanos , Disseminação de Informação , Proteômica/métodos
11.
OMICS ; 18(1): 10-4, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24456465

RESUMO

Biological processes are fundamentally driven by complex interactions between biomolecules. Integrated high-throughput omics studies enable multifaceted views of cells, organisms, or their communities. With the advent of new post-genomics technologies, omics studies are becoming increasingly prevalent; yet the full impact of these studies can only be realized through data harmonization, sharing, meta-analysis, and integrated research. These essential steps require consistent generation, capture, and distribution of metadata. To ensure transparency, facilitate data harmonization, and maximize reproducibility and usability of life sciences studies, we propose a simple common omics metadata checklist. The proposed checklist is built on the rich ontologies and standards already in use by the life sciences community. The checklist will serve as a common denominator to guide experimental design, capture important parameters, and be used as a standard format for stand-alone data publications. The omics metadata checklist and data publications will create efficient linkages between omics data and knowledge-based life sciences innovation and, importantly, allow for appropriate attribution to data generators and infrastructure science builders in the post-genomics era. We ask that the life sciences community test the proposed omics metadata checklist and data publications and provide feedback for their use and improvement.


Assuntos
Disseminação de Informação/ética , Metagenômica/estatística & dados numéricos , Projetos de Pesquisa/normas , Mineração de Dados , Humanos , Metagenômica/economia , Metagenômica/tendências , Editoração , Reprodutibilidade dos Testes
12.
J Proteome Res ; 13(1): 107-13, 2014 Jan 03.
Artigo em Inglês | MEDLINE | ID: mdl-24350770

RESUMO

The Model Organism Protein Expression Database (MOPED, http://moped.proteinspire.org) is an expanding proteomics resource to enable biological and biomedical discoveries. MOPED aggregates simple, standardized and consistently processed summaries of protein expression and metadata from proteomics (mass spectrometry) experiments from human and model organisms (mouse, worm, and yeast). The latest version of MOPED adds new estimates of protein abundance and concentration as well as relative (differential) expression data. MOPED provides a new updated query interface that allows users to explore information by organism, tissue, localization, condition, experiment, or keyword. MOPED supports the Human Proteome Project's efforts to generate chromosome- and diseases-specific proteomes by providing links from proteins to chromosome and disease information as well as many complementary resources. MOPED supports a new omics metadata checklist to harmonize data integration, analysis, and use. MOPED's development is driven by the user community, which spans 90 countries and guides future development that will transform MOPED into a multiomics resource. MOPED encourages users to submit data in a simple format. They can use the metadata checklist to generate a data publication for this submission. As a result, MOPED will provide even greater insights into complex biological processes and systems and enable deeper and more comprehensive biological and biomedical discoveries.


Assuntos
Bases de Dados de Proteínas , Proteômica , Animais , Humanos , Interface Usuário-Computador
13.
PLoS Comput Biol ; 9(3): e1002967, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23516350

RESUMO

Life science technologies generate a deluge of data that hold the keys to unlocking the secrets of important biological functions and disease mechanisms. We present DEAP, Differential Expression Analysis for Pathways, which capitalizes on information about biological pathways to identify important regulatory patterns from differential expression data. DEAP makes significant improvements over existing approaches by including information about pathway structure and discovering the most differentially expressed portion of the pathway. On simulated data, DEAP significantly outperformed traditional methods: with high differential expression, DEAP increased power by two orders of magnitude; with very low differential expression, DEAP doubled the power. DEAP performance was illustrated on two different gene and protein expression studies. DEAP discovered fourteen important pathways related to chronic obstructive pulmonary disease and interferon treatment that existing approaches omitted. On the interferon study, DEAP guided focus towards a four protein path within the 26 protein Notch signalling pathway.


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Modelos Biológicos , Transdução de Sinais , Algoritmos , Simulação por Computador , Bases de Dados Genéticas , Doença/genética , Humanos , Reprodutibilidade dos Testes
14.
Metabolites ; 3(3): 741-60, 2013 Sep 03.
Artigo em Inglês | MEDLINE | ID: mdl-24958148

RESUMO

The integrative personal omics profile (iPOP) is a pioneering study that combines genomics, transcriptomics, proteomics, metabolomics and autoantibody profiles from a single individual over a 14-month period. The observation period includes two episodes of viral infection: a human rhinovirus and a respiratory syncytial virus. The profile studies give an informative snapshot into the biological functioning of an organism. We hypothesize that pathway expression levels are associated with disease status. To test this hypothesis, we use biological pathways to integrate metabolomics and proteomics iPOP data. The approach computes the pathways' differential expression levels at each time point, while taking into account the pathway structure and the longitudinal design. The resulting pathway levels show strong association with the disease status. Further, we identify temporal patterns in metabolite expression levels. The changes in metabolite expression levels also appear to be consistent with the disease status. The results of the integrative analysis suggest that changes in biological pathways may be used to predict and monitor the disease. The iPOP experimental design, data acquisition and analysis issues are discussed within the broader context of personal profiling.

15.
Big Data ; 1(1): 42-50, 2013 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-27447037

RESUMO

The life sciences have entered into the realm of big data and data-enabled science, where data can either empower or overwhelm. These data bring the challenges of the 5 Vs of big data: volume, veracity, velocity, variety, and value. Both independently and through our involvement with DELSA Global (Data-Enabled Life Sciences Alliance, DELSAglobal.org), the Kolker Lab ( kolkerlab.org ) is creating partnerships that identify data challenges and solve community needs. We specialize in solutions to complex biological data challenges, as exemplified by the community resource of MOPED (Model Organism Protein Expression Database, MOPED.proteinspire.org ) and the analysis pipeline of SPIRE (Systematic Protein Investigative Research Environment, PROTEINSPIRE.org ). Our collaborative work extends into the computationally intensive tasks of analysis and visualization of millions of protein sequences through innovative implementations of sequence alignment algorithms and creation of the Protein Sequence Universe tool (PSU). Pushing into the future together with our collaborators, our lab is pursuing integration of multi-omics data and exploration of biological pathways, as well as assigning function to proteins and porting solutions to the cloud. Big data have come to the life sciences; discovering the knowledge in the data will bring breakthroughs and benefits.

16.
Big Data ; 1(4): 196-201, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-27447251

RESUMO

Biological processes are fundamentally driven by complex interactions between biomolecules. Integrated high-throughput omics studies enable multifaceted views of cells, organisms, or their communities. With the advent of new post-genomics technologies, omics studies are becoming increasingly prevalent; yet the full impact of these studies can only be realized through data harmonization, sharing, meta-analysis, and integrated research. These essential steps require consistent generation, capture, and distribution of metadata. To ensure transparency, facilitate data harmonization, and maximize reproducibility and usability of life sciences studies, we propose a simple common omics metadata checklist. The proposed checklist is built on the rich ontologies and standards already in use by the life sciences community. The checklist will serve as a common denominator to guide experimental design, capture important parameters, and be used as a standard format for stand-alone data publications. The omics metadata checklist and data publications will create efficient linkages between omics data and knowledge-based life sciences innovation and, importantly, allow for appropriate attribution to data generators and infrastructure science builders in the post-genomics era. We ask that the life sciences community test the proposed omics metadata checklist and data publications and provide feedback for their use and improvement.

17.
Big Data ; 1(4): 237-44, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-27447256

RESUMO

Children with special healthcare needs (CSHCN) require health and related services that exceed those required by most hospitalized children. A small but growing and important subset of the CSHCN group includes medically complex children (MCCs). MCCs typically have comorbidities and disproportionately consume healthcare resources. To enable strategic planning for the needs of MCCs, simple screens to identify potential MCCs rapidly in a hospital setting are needed. We assessed whether the number of medications used and the class of those medications correlated with MCC status. Retrospective analysis of medication data from the inpatients at Seattle Children's Hospital found that the numbers of inpatient and outpatient medications significantly correlated with MCC status. Numerous variables based on counts of medications, use of individual medications, and use of combinations of medications were considered, resulting in a simple model based on three different counts of medications: outpatient and inpatient drug classes and individual inpatient drug names. The combined model was used to rank the patient population for medical complexity. As a result, simple, objective admission screens for predicting the complexity of patients based on the number and type of medications were implemented.

18.
Nucleic Acids Res ; 40(Database issue): D1093-9, 2012 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22139914

RESUMO

Large numbers of mass spectrometry proteomics studies are being conducted to understand all types of biological processes. The size and complexity of proteomics data hinders efforts to easily share, integrate, query and compare the studies. The Model Organism Protein Expression Database (MOPED, htttp://moped.proteinspire.org) is a new and expanding proteomics resource that enables rapid browsing of protein expression information from publicly available studies on humans and model organisms. MOPED is designed to simplify the comparison and sharing of proteomics data for the greater research community. MOPED uniquely provides protein level expression data, meta-analysis capabilities and quantitative data from standardized analysis. Data can be queried for specific proteins, browsed based on organism, tissue, localization and condition and sorted by false discovery rate and expression. MOPED empowers users to visualize their own expression data and compare it with existing studies. Further, MOPED links to various protein and pathway databases, including GeneCards, Entrez, UniProt, KEGG and Reactome. The current version of MOPED contains over 43,000 proteins with at least one spectral match and more than 11 million high certainty spectra.


Assuntos
Bases de Dados de Proteínas , Proteínas/metabolismo , Animais , Humanos , Espectrometria de Massas , Camundongos , Modelos Animais , Proteômica , Interface Usuário-Computador
19.
OMICS ; 15(7-8): 513-21, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21809957

RESUMO

To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.


Assuntos
Proteínas/classificação , Bases de Dados de Proteínas , Proteínas/química , Proteínas/metabolismo
20.
J Proteomics ; 75(1): 116-21, 2011 Dec 10.
Artigo em Inglês | MEDLINE | ID: mdl-21718813

RESUMO

In high-throughput mass spectrometry proteomics, peptides and proteins are not simply identified as present or not present in a sample, rather the identifications are associated with differing levels of confidence. The false discovery rate (FDR) has emerged as an accepted means for measuring the confidence associated with identifications. We have developed the Systematic Protein Investigative Research Environment (SPIRE) for the purpose of integrating the best available proteomics methods. Two successful approaches to estimating the FDR for MS protein identifications are the MAYU and our current SPIRE methods. We present here a method to combine these two approaches to estimating the FDR for MS protein identifications into an integrated protein model (IPM). We illustrate the high quality performance of this IPM approach through testing on two large publicly available proteomics datasets. MAYU and SPIRE show remarkable consistency in identifying proteins in these datasets. Still, IPM results in a more robust FDR estimation approach and additional identifications, particularly among low abundance proteins. IPM is now implemented as a part of the SPIRE system.


Assuntos
Ensaios de Triagem em Larga Escala/métodos , Proteínas/análise , Proteômica/métodos , Bases de Dados de Proteínas , Reações Falso-Positivas , Espectrometria de Massas/métodos , Modelos Químicos , Proteínas/química
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...